Instacart is a grocery delivery startup. They facilitate deliveries of on-demand grocery orders to homes within an hour in major US cites. As of June 2017 the company valuation is at $3.4 billion. They recently published a dataset of groceries data which is perfect for market basket analysis. In this post, I intend to learn some interesting patterns to determine what products customers purchase together.
The dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. You can download the dataset here
Apriori is an algorithm that mines transaction databases for frequent itemsets especially when items are associated with each other. This can be very useful for analyzing shopping carts and finding interesting patterns. I will be using the arules package in R which provides the facility to implement apriori.
Using the apriori algorithm, we might find rules that indicate grocery items that are commonly bought together with other items. For instance, the rule {Peanut Butter, Jelly} => {Bread} indicates that customers who purchase Peanut Butter & Jelly are likely to purchase Bread. This information can be very useful for product recommendations, cross marketing, and promotions to increase sales.
There are three parameters that are key to understanding the apriori algorithm, they include Confidence, Support, and Lift.
Support is an indication of how frequently the itemset appears in the dataset.
Confidence is an indication of how often the rule has been found to be true.
A lift greater than 1 lets us know the degree to which occurrences are dependent on one another.
library(arules)
library(arulesViz)
library(tidyverse)
library(plotly)order_products = read.csv("order_products_train.csv", nrows = 750000)
products = read.csv("products.csv") data =
left_join(order_products, products, by = "product_id") %>%
select(order_id, product_name)dataConvert dataframe into transactions as required by arules package. The transactions data format forces a dataframe into a sparse matrix with 39320 transactions (rows) and 30896 items (columns). Each cell in the matrix records 1 if customer placed an order for product, if not then O.
write_csv(data, 'data.csv')
transactions = read.transactions('basket_tmp.csv', format = "single", sep = ",", cols = c("order_id", "product_name"))
transactionstransactions in sparse format with
39320 transactions (rows) and
30896 items (columns)
inspect(transactions[1:5]) items transactionID
[1] {Bag of Organic Bananas,
Bulgarian Yogurt,
Cucumber Kirby,
Lightly Smoked Sardines in Olive Oil,
Organic 4% Milk Fat Whole Milk Cottage Cheese,
Organic Celery Hearts,
Organic Hass Avocado,
Organic Whole String Cheese} 1
[2] {Apple Sauce,
Original Real Crumbled Bacon} 1000162
[3] {ProteinPLUS Multigrain Penne Pasta,
Sesame Topped Hamburger Buns} 1000197
[4] {Curate Melon Pomelo Sparking Water,
Milk, Vitamin D,
Organic Super Fruit Punch Juice Drink,
Sausage Links} 1000209
[5] {Coconut Milk Virgin Chocolate Bar,
Disinfecting Bathroom Cleaner - Lemongrass Citrus,
Green Clary Sage & Citrus All-Purpose Cleaner,
Ramen, Vegan, Miso,
The Ring Vegetable Brush} 1000222
itemFrequencyPlot(transactions, type="absolute", top = 20, cex.names = 0.9, border = NA )Fruit and vegetables are very popular grocery carts items.
basket_sets = apriori(transactions, parameter = list(supp=0.002, minlen=2), control = list(verbose = FALSE))
basket_set_dataframe = DATAFRAME(sort(itemsets, by="support", decreasing = T))
basket_set_dataframepar(mar=c(3, 20, 0, 0) + .4)
barplot(basket_set_dataframe$support, names = basket_set_dataframe$items, las = 2, horiz = T, cex.names = 0.9, border = NA, cex.axis = .8 )This is the heart of the apriori algorithm where we set up a rules object to mine frequent itemsets.
rules = apriori(transactions, parameter = list(support = 0.001, confidence=0.25, minlen = 2 ))summary(rules)set of 522 rules
rule length distribution (lhs + rhs):sizes
2 3 4
155 353 14
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 2.00 3.00 2.73 3.00 4.00
summary of quality measures:
support confidence lift
Min. :0.001017 Min. :0.2500 Min. : 1.915
1st Qu.:0.001144 1st Qu.:0.2716 1st Qu.: 2.629
Median :0.001373 Median :0.3003 Median : 3.751
Mean :0.001943 Mean :0.3248 Mean : 6.513
3rd Qu.:0.001850 3rd Qu.:0.3538 3rd Qu.: 5.145
Max. :0.021109 Max. :0.7019 Max. :91.019
mining info:
data ntransactions support confidence
transactions 39320 0.001 0.25
For example the first rule below states customers who bought {Organic Fuji Apples} are likely to also purchase {Bag of Organic Bananas} with a lift of 3.6. A lift is how much more likely an item will be purchased given the other item. The higher the lift the better. With the table below we can interactively sort the lift, support, and confidence .
inspectDT(rules)Interactive Visualization
Sifting through rules can be time consuming. The interactive scatter plot below efficiently communicates relationships between lift, confidence and support.
plotly_arules(rules)plot(rules[1:10], method="graph" )